Packages Installation

# Make sure, that StringsAsFactors = FALSE
options(stringsAsFactors = F)

if(!require(install.load)){
  install.packages("install.load")
  library(install.load)
}

install_load("tidyverse", "moments", "plotly", "data.table", "fitdistrplus")
library(tidyverse)
library(moments)
library(plotly)
library(data.table)
library(fitdistrplus)

General Task

Task 1

Data Import

In this task, we have to create a distribution of logistic delay of component K7. Two data sets were provided, which are Komponente_K7.csv, which contains production date, and Logistikverzug_K7.csv, which contains incoming date of the product. To start analyzing the data, we must import the data sets and set their column types accordingly.

# Ror semicolon delimited
K7 <- read_csv2("Data/Logistikverzug/Komponente_K7.csv")[2:6]
K7$Fehlerhaft <- as.logical(K7$Fehlerhaft)
K7$Produktionsdatum <- as.Date(K7$Produktionsdatum,
                                          format = "%Y-%m-%d")  
# For comma delimited
LK7 <- read_csv("Data/Logistikverzug/Logistikverzug_K7.csv")[2:6]
LK7$Fehlerhaft <- as.logical(LK7$Fehlerhaft)
LK7$Wareneingang <- as.Date(LK7$Wareneingang,
                                          format="%Y-%m-%d")

Data Preparation

After importing the data, we check if the column Fehlerhaft in both data sets are equal.

Result<- K7$Fehlerhaft == LK7$Fehlerhaft
Result[FALSE]
## logical(0)

Based on this result, the column Fehlerhaft are in both data sets equal. Now, both data sets K7 and LK7 can be combined by using full_join(). A column with the name Wochentag is created for the day of production date. Afterwards, all NA values in the column Produktionsdatum and Wareneingang are filtered. A column Warenausgang is created, if the day of production date is Friday and Saturday. If the production day is Friday and Saturday, the product can be sent only on Monday, hence the production date is added by 3 or 2 respectively. For all other days, this date is added by 1.

logistics_delay <- K7 %>% 
  full_join(LK7, by = c("IDNummer","Herstellernummer", "Werksnummer","Fehlerhaft")) %>%
  mutate(Wochentag = weekdays(Produktionsdatum)) %>%  # Creates a new column Wochentag
  filter(!is.na(Produktionsdatum)) %>% # Filter NA values
  filter(!is.na(Wareneingang)) %>% # Filter NA values
  mutate(Warenausgang = ifelse(Wochentag == "Freitag",
                               Produktionsdatum +3,
                               ifelse(Wochentag == "Samstag",
                                      Produktionsdatum +2,
                                      Produktionsdatum +1))) 

The column Warenausgang can now be formatted in date-format. Logistic delay can then be calculated by substracting the incoming date (Wareneingang) with the outgoing date (Warenausgang).

logistics_delay <- logistics_delay %>%
  mutate(Warenausgang = as.Date(Warenausgang,
                                origin = "1970-01-01",
                                format="%Y-%m-%d")) %>% 
  mutate(Verzug = Wareneingang - Warenausgang)

Task 1.a - Logistics Distribution

To determine the distribution of the logistic delay, we use the function descdist from the package fitdistrplus. To use this, the column Verzug needs to be converted in numeric. Afterwards, the column is analyzed for discrete distribution, since the data contains integer values.

# Convert Verzug column in numeric
logistics_delay$Verzug <- as.numeric(logistics_delay$Verzug) 

# Determine the logistics distribution
descdist(logistics_delay$Verzug, discrete = TRUE)

## summary statistics
## ------
## min:  3   max:  14 
## median:  6 
## mean:  6.080441 
## estimated sd:  1.012306 
## estimated skewness:  0.567401 
## estimated kurtosis:  3.630036

According to the Cullen and Frey graph, the logistics data follows negative binomial distribution. Based on the fact that the skewness value is positive and the mean value is slightly larger than median, the distribution is positively skewed (Source: https://www.vrcbuzz.com/karl-pearsons-coefficient-of-skewness-using-r-with-examples/).

The skewness value is to analyze the symmetrie or the lack of symmetrie of the distribution. In this case it is possible to observe that this is negative. The value is -0,84. This means that the tail is on the left side of the distribution. The kurtosis value is greather than 3, it is said to be leptokurtic.This means that there are more chances to be outliers and that the distribution is peakead and haslong tails (thick)

summary(logistics_delay$Verzug)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    6.08    7.00   14.00
#skew(logistics_delay$Verzug)
mean <- mean(logistics_delay$Verzug)
median <- median(logistics_delay$Verzug)
stdev <- sd(logistics_delay$Verzug)
skewness <-3*(mean-median)/stdev #Pearson’s Coefficient of Skewness using the median
kurtosis<-kurtosis(logistics_delay$Verzug)
cat("skewness: ",skewness)
## skewness:  0.2383905
cat("kurtosis: ",kurtosis)
## kurtosis:  3.630006
#https://www.vrcbuzz.com/karl-pearsons-coefficient-of-skewness-using-r-with-examples/

Task 1. b - Logistics Delay Mean

To calculate the mean, following code is implemented.

mean <- as.numeric(mean(logistics_delay$Verzug)) 

# Ausgeben der Ergebnisse
cat("Mean = ", mean)
## Mean =  6.080441

Task 1.c - Visualization

binwith <-(14-1)/14
logistics_delay <-ggplot(logistics_delay, aes(x = Verzug ))+
  geom_histogram(aes(y = stat(density)),colour="black", fill="white", binwidth=1)+
  scale_x_continuous(breaks = c(0:14))+
  geom_density( fill="#FF6666",adjust = 10,alpha = 0.5) 


ggplotly(logistics_delay)

Task 1.d - Decision Tree

To create a decision tree, its important to know what and how many attributes there are, that influence the defect of component K7. Based on the data set K7, there are four columns that could be relevant. These are IDNummer, Produktionsdatum, Herstellernummer and Werksnummer. It could be noticed, that the IDNummer, Hersteller, and Werksnummer contain similar information. Herstellernummer and Werksnummer are directly connected, hence both should not be considered as different attribute. Inside the IDNummer, Herstellernummer and Werksnummer are also included and followed by the serie number of production. Hence, the IDNummer also should not be distinguished as an extra attribute. Based on this, the defect of component K7 is attributed solely from Produktionsdatum and Herstellernummer. The data set could be summarized as follow:

K7_decision_tree <- K7 %>% 
  dplyr::select(Produktionsdatum, Herstellernummer, Fehlerhaft) %>% 
  arrange(desc(Fehlerhaft))

head(K7_decision_tree)
## # A tibble: 6 × 3
##   Produktionsdatum Herstellernummer Fehlerhaft
##   <date>                      <dbl> <lgl>     
## 1 2011-02-02                    114 TRUE      
## 2 2011-06-02                    114 TRUE      
## 3 2013-10-29                    112 TRUE      
## 4 2014-01-14                    112 TRUE      
## 5 2014-06-02                    112 TRUE      
## 6 2014-11-20                    112 TRUE
tail(K7_decision_tree)
## # A tibble: 6 × 3
##   Produktionsdatum Herstellernummer Fehlerhaft
##   <date>                      <dbl> <lgl>     
## 1 2016-11-12                    112 FALSE     
## 2 2016-11-12                    112 FALSE     
## 3 2016-11-12                    112 FALSE     
## 4 2016-11-12                    112 FALSE     
## 5 2016-11-12                    112 FALSE     
## 6 2016-11-13                    112 FALSE

The next step would be to determine, which of the attributes can be considered as root characteristics. To do this, we have to measure Information Gain value \(IG(D,A)\) by applying each attribute A of both Produktionsdatum and Herstellernummer to the destination attribute D, which is in this case Fehlerhaft . This can be calculated by substracting \(H(D)\), which is the entropy of the datset of destination attribute D, with \(Rest(A)\), which is the remaining entropy that is still there after testing attribute A.

To calculate entropy of the dataset D, following equation is used, whereby the variable \(q\) describes the proportion of defective components to the total number of components:

Afterwards, we can calculate the remaining entropy of each attributes (Produktionsdatum and Herstellernummer) by using following equation:

For this, \(p_{k}\) and \(n_{k}\) corresponds with the number of defective and non-defective components in correlation with each component of corresponding attributes. \(q_{k}\) can be calculated by dividing \(p_{k}\) with the sum of \(p_{k}\) and \(n_{k}\). By using this equation, we can calculate the remaining entropy for both Produktionsdatum - \(Rest(Produktionsdatum)\) - and Herstellersnummer - \(Rest(Herstellersnummer)\) and hence determine the Information Gain for both \(IG(D, Produktionsdatum\) and \(IG(D, Produktionsdatum)\). The attribute with the largest information gain would then be selected as the root attribute and the other as the sub-attribute. The decision tree may look as follow.

Task 2 Data Base Structure

Task 3 Components

#reading txt file 
T16 <-file("Data/Einzelteil/Einzelteil_T16.txt",open = "r")
lines <- readLines(T16 )%>%str_replace_all(., "\t", "\n")%>%str_replace_all(., " \\| \\| ", "\t")
file <-fread(text =lines)
#reading zulassung file
zulassung <- read_csv2("Data/Zulassungen/Zulassungen_alle_Fahrzeuge.csv")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## New names:
## Rows: 3204104 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ";" chr
## (2): IDNummer, Gemeinden dbl (1): ...1 date (1): Zulassung
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
#filtering Zulassung 
zulassung_adel=zulassung%>%filter(Gemeinden =="ADELSHOFEN")
#
#reading T16 Teil Komponente

#for semicolon delimited
T16_Kom_1 <- read_csv2("Data/Komponente/Bestandteile_Komponente_K2LE2.csv")%>%dplyr::select("ID_T16","ID_K2LE2")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## New names:
## Rows: 163769 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ";" chr
## (4): ID_T16, ID_T19, ID_T20, ID_K2LE2 dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
names(T16_Kom_1)[2]<-"ID_Sitze"
T16_Kom_2 <- read_csv2("Data/Komponente/Bestandteile_Komponente_K2ST2.csv")%>%dplyr::select("ID_T16","ID_K2ST2")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## New names:Rows: 655075 Columns: 5── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (4): ID_T16, ID_T17, ID_T18, ID_K2ST2
## dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(T16_Kom_2)[2]<-"ID_Sitze"
#reading Komponente & Combine them 
#Kom_K2LE2 <-file("Data/Komponente/Komponente_K2LE2.txt",open = "r")
#K2LE2_L <- readLines(T16 )%>%str_replace_all(., " \ ", "\t")
#K2LE2 <-fread(text =lines)

Kom_K2LE2<- read.delim("Data/Komponente/Komponente_K2LE2.txt",sep = "\\")%>%dplyr::select(c("ID_Sitze","Werksnummer")) # not need to read it doesnt give any extra info
Kom_K2ST2<-read_csv2("Data/Komponente/Komponente_K2ST2.csv")%>%dplyr::select(c("ID_Sitze","Werksnummer"))
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## New names:
## Rows: 655075 Columns: 10
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ";" chr
## (2): ID_Sitze, origin dbl (7): ...1, X1, Herstellernummer, Werksnummer,
## Fehlerhaft, Fehlerhaft_Fa... date (1): Fehlerhaft_Datum
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
T16_K2LE2<-T16_Kom_1%>%left_join(Kom_K2LE2,by="ID_Sitze")#c("ID_Kom"="ID_Sitze")
T16_K2ST2<-T16_Kom_2%>%left_join(Kom_K2ST2,by="ID_Sitze")
T16_Kom_bind<-rbind(T16_K2LE2,T16_K2ST2)
#read Bestandteil Fahrzeuge and combine them
Kom_Fahr_1<-read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ21.csv")%>%dplyr::select(c("ID_Sitze","ID_Fahrzeug"))
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## New names:
## Rows: 512354 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ";" chr
## (5): ID_Karosserie, ID_Schaltung, ID_Sitze, ID_Motor, ID_Fahrzeug dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
Kom_Fahr_2<-read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ22.csv")%>%dplyr::select(c("ID_Sitze","ID_Fahrzeug"))
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 306490 Columns: 6── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (5): ID_Karosserie, ID_Schaltung, ID_Sitze, ID_Motor, ID_Fahrzeug
## dbl (1): X1
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Kom_Fahr_bind<-rbind(Kom_Fahr_1,Kom_Fahr_2)
#join the different datasets to create a dataset that only contains the IDs of the components that where produced in ADELSHOFEN
Kom_Fahr <- T16_Kom_bind %>%left_join(Kom_Fahr_bind, by ="ID_Sitze")
Kom_Fahr_Adel <-Kom_Fahr%>%inner_join(zulassung_adel,by =c("ID_Fahrzeug"="IDNummer"))

Number of Cars register in ADELSHOFEN:

nrow(Kom_Fahr_Adel)
## [1] 48

Task 4 Data Type Attributes

The Data type of the table “Zulassung” is shown with the str() function

str(zulassung)
## spec_tbl_df [3,204,104 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1     : num [1:3204104] 408097 408098 1 2 3 ...
##  $ IDNummer : chr [1:3204104] "11-1-11-1" "11-1-11-2" "12-1-12-1" "12-1-12-2" ...
##  $ Gemeinden: chr [1:3204104] "DRESDEN" "DRESDEN" "LEIPZIG" "LEIPZIG" ...
##  $ Zulassung: Date[1:3204104], format: "2009-01-01" "2009-01-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   IDNummer = col_character(),
##   ..   Gemeinden = col_character(),
##   ..   Zulassung = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>

Task 5 Server

To Save the records on the database of a server has the advantage that no local memory is occupied but more important, a permanent access to the most recently data records can be in most cases guaranteed, with a few exception as for example when maintenance takes place.

Since databases tend to store a large amount of data, this could cause the computer to slow down. Also the quality of the data deteriorate very fast, since the data doesnt get updated.

“RStudio Shiny Server Open Source” allows customers to get access to the app from anywhere with any web browser

Task 6 Car Rsegistration

car<-read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ12.csv")%>%dplyr::select(c("ID_Karosserie","ID_Fahrzeug"))%>%filter(ID_Karosserie == "K5-112-1122-79")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## New names:
## Rows: 408096 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ";" chr
## (5): ID_Karosserie, ID_Schaltung, ID_Sitze, ID_Motor, ID_Fahrzeug dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
car_registerd<-filter(zulassung, IDNummer == car$ID_Fahrzeug)
gemeinden <- as.character(car_registerd$Gemeinden)
cat("Car with ID Karosserie'K5-112-1122-79' got registered in ",
    gemeinden) 
## Car with ID Karosserie'K5-112-1122-79' got registered in  ASCHERSLEBEN